Pegasus Tutorial¶

Author: Yiming Yang, Rimte Rocher
Date: 2022-02-24
Notebook Source: pegasus_analysis.ipynb

In [1]:
import pegasus as pg

Count Matrix File¶

For this tutorial, we provide a count matrix dataset on Human Bone Marrow with 8 donors stored in zarr format (with file extension ".zarr.zip").

You can download the data at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/MantonBM_nonmix_subset.zarr.zip.

This file is achieved by aggregating gene-count matrices of the 8 10X channels using PegasusIO, and further filtering out cells with fewer than $100$ genes expressed. Please see here for how to do it interactively.

Now load the file using pegasus read_input function:

In [2]:
data = pg.read_input("MantonBM_nonmix_subset.zarr.zip")
data
2022-02-24 14:39:59,474 - pegasusio.readwrite - INFO - zarr file 'MantonBM_nonmix_subset.zarr.zip' is loaded.
2022-02-24 14:39:59,475 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.33s.
Out[2]:
MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 48219 x 36601
    Genome: GRCh38; Modality: rna
    It contains 1 matrix: 'X'
    It currently binds to matrix 'X' as X

    obs: 'n_genes', 'Channel', 'gender'
    var: 'featureid'
    uns: 'genome', 'modality'

The count matrix is managed as a UnimodalData object defined in PegasusIO module, and users can manipulate the data from top level via MultimodalData structure, which can contain multiple UnimodalData objects as members.

For this example, as show above, data is a MultimodalData object, with only one UnimodalData member of key "GRCh38-rna", which is its default UnimodalData. Any operation on data will be applied to this default UnimodalData object.

UnimodalData has the following structure:

It has 6 major parts:

  • Raw count matrix: data.X, a Scipy sparse matrix (sometimes can be a dense matrix), with rows the cell barcodes, columns the genes/features:
In [3]:
data.X
Out[3]:
<48219x36601 sparse matrix of type '<class 'numpy.int32'>'
	with 39997174 stored elements in Compressed Sparse Row format>

This dataset contains $48,219$ barcodes and $36,601$ genes.

  • Cell barcode attributes: data.obs, a Pandas data frame with barcode as the index. For now, there is only one attribute "Channel" referring to the donor from which the cell comes from:
In [4]:
data.obs.head()
Out[4]:
n_genes Channel gender
barcodekey
MantonBM1_HiSeq_1-AAACCTGAGCAGGTCA 816 MantonBM1_HiSeq_1 female
MantonBM1_HiSeq_1-AAACCTGCACACTGCG 716 MantonBM1_HiSeq_1 female
MantonBM1_HiSeq_1-AAACCTGCACCGGAAA 554 MantonBM1_HiSeq_1 female
MantonBM1_HiSeq_1-AAACCTGCATAGACTC 967 MantonBM1_HiSeq_1 female
MantonBM1_HiSeq_1-AAACCTGCATCGATGT 1704 MantonBM1_HiSeq_1 female
In [5]:
data.obs['Channel'].value_counts()
Out[5]:
MantonBM6_HiSeq_1    6748
MantonBM8_HiSeq_1    6092
MantonBM4_HiSeq_1    6068
MantonBM7_HiSeq_1    6025
MantonBM5_HiSeq_1    5963
MantonBM2_HiSeq_1    5930
MantonBM1_HiSeq_1    5837
MantonBM3_HiSeq_1    5556
Name: Channel, dtype: int64
  • Gene attributes: data.var, also a Pandas data frame, with gene name as the index. For now, it only has one attribute "gene_ids" referring to the unique gene ID in the experiment:
In [6]:
data.var.head()
Out[6]:
featureid
featurekey
MIR1302-2HG ENSG00000243485
FAM138A ENSG00000237613
OR4F5 ENSG00000186092
AL627309.1 ENSG00000238009
AL627309.3 ENSG00000239945
  • Unstructured information: data.uns, a Python hashed dictionary. It usually stores information not restricted to barcodes or features, but about the whole dataset, such as its genome reference and modality type:
In [7]:
data.uns['genome']
Out[7]:
'GRCh38'
In [8]:
data.uns['modality']
Out[8]:
'rna'
  • Finally, embedding attributes on cell barcodes: data.obsm; as well as on genes, data.varm. We'll see it in later sections.

Preprocessing¶

Filtration¶

The first step in preprocessing is to perform the quality control analysis, and remove cells and genes of low quality.

We can generate QC metrics using the following method with default settings:

In [9]:
pg.qc_metrics(data, min_genes=500, max_genes=6000, mito_prefix='MT-', percent_mito=10)

The metrics considered are:

  • Number of genes: Keep cells with $500 \leq \text{# Genes} < 6000$. **(Notice that starting from v1.4.0, Pegasus doesn't filter cells by number of genes threshold by default.)**
  • Number of UMIs: Don't filter cells due to UMI bounds (Default);
  • Percent of Mitochondrial genes: Look for mitochondrial genes with name prefix MT- **(Notice that starting from v1.4.0, you'll need to manually specify this prefix.)**, and keep cells with percent $< 10\%$.

For details on customizing your own thresholds, see documentation.

Numeric summaries on filtration on cell barcodes and genes can be achieved by get_filter_stats method:

In [10]:
df_qc = pg.get_filter_stats(data)
df_qc
Out[10]:
kept median_n_genes median_n_umis median_percent_mito filt total median_n_genes_before median_n_umis_before median_percent_mito_before
Channel
MantonBM5_HiSeq_1 4090 770.0 2795.0 3.136190 1873 5963 650.0 2139.0 3.399615
MantonBM4_HiSeq_1 4172 790.0 2278.5 3.271181 1896 6068 672.0 1764.0 3.519009
MantonBM3_HiSeq_1 4225 779.0 2621.0 3.274451 1331 5556 715.0 2229.0 3.449398
MantonBM1_HiSeq_1 4415 790.0 2533.0 3.713331 1422 5837 723.0 2149.0 3.855422
MantonBM7_HiSeq_1 4452 745.0 2403.5 3.053718 1573 6025 679.0 2053.0 3.177570
MantonBM8_HiSeq_1 4511 735.0 2561.0 3.520510 1581 6092 671.5 2212.0 3.706849
MantonBM6_HiSeq_1 4665 852.0 2700.0 3.032258 2083 6748 741.0 2129.0 3.345829
MantonBM2_HiSeq_1 4935 801.0 2486.0 3.514056 995 5930 756.0 2261.5 3.534756

The results is a Pandas data frame on samples.

You can also check the QC stats via plots. Below is on number of genes:

In [11]:
pg.qcviolin(data, plot_type='gene', dpi=100)

Then on number of UMIs:

In [12]:
pg.qcviolin(data, plot_type='count', dpi=100)

On number of percentage of mitochondrial genes:

In [13]:
pg.qcviolin(data, plot_type='mito', dpi=100)

Now filter cells based on QC metrics set in qc_metrics:

In [14]:
pg.filter_data(data)
2022-02-24 14:40:01,496 - pegasusio.qc_utils - INFO - After filtration, 35465 out of 48219 cell barcodes are kept in UnimodalData object GRCh38-rna.
2022-02-24 14:40:01,497 - pegasus.tools.preprocessing - INFO - Function 'filter_data' finished in 0.23s.

You can see that $35,465$ cells ($73.55\%$) are kept.

Moreover, for genes, only those with no cell expression are removed. After that, we identify robust genes for downstream analysis:

In [15]:
pg.identify_robust_genes(data)
2022-02-24 14:40:02,097 - pegasus.tools.preprocessing - INFO - After filtration, 25653/36601 genes are kept. Among 25653 genes, 17516 genes are robust.
2022-02-24 14:40:02,098 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.60s.

The metric is the following:

  • Gene is expressed in at least $0.05\%$ of cells, i.e. among every 6000 cells, there are at least 3 cells expressing this gene.

Please see its documentation for details.

As a result, $25,653$ ($70.09\%$) genes are kept. Among them, $17,516$ are robust.

We can now view the cells of each sample after filtration:

In [16]:
data.obs['Channel'].value_counts()
Out[16]:
MantonBM2_HiSeq_1    4935
MantonBM6_HiSeq_1    4665
MantonBM8_HiSeq_1    4511
MantonBM7_HiSeq_1    4452
MantonBM1_HiSeq_1    4415
MantonBM3_HiSeq_1    4225
MantonBM4_HiSeq_1    4172
MantonBM5_HiSeq_1    4090
Name: Channel, dtype: int64

Normalization and Logarithmic Transformation¶

After filtration, we need to first normalize the distribution of counts w.r.t. each cell to have the same sum (default is $10^5$, see documentation), and then transform into logarithmic space by $log(x + 1)$ to avoid number explosion:

In [17]:
pg.log_norm(data)
2022-02-24 14:40:02,788 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.67s.

For the downstream analysis, we may need to make a copy of the count matrix, in case of coming back to this step and redo the analysis:

In [18]:
data_trial = data.copy()

Highly Variable Gene Selection¶

Highly Variable Genes (HVG) are more likely to convey information discriminating different cell types and states. Thus, rather than considering all genes, people usually focus on selected HVGs for downstream analyses.

In [19]:
pg.highly_variable_features(data_trial)
2022-02-24 14:40:03,219 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.09s.
2022-02-24 14:40:03,262 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-02-24 14:40:03,263 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.14s.

By default, we select 2000 HVGs using the pegasus selection method. Alternative, you can also choose the traditional method that both Seurat and SCANPY use, by setting flavor='Seurat'. See documentation for details.

We can view HVGs by ranking them from top:

In [20]:
data_trial.var.loc[data_trial.var['highly_variable_features']].sort_values(by='hvf_rank')
Out[20]:
featureid n_cells percent_cells robust highly_variable_features mean var hvf_loess hvf_rank
featurekey
LYZ ENSG00000090382 8566 24.153391 True True 1.526394 8.110589 3.775874 3
S100A9 ENSG00000163220 8182 23.070633 True True 1.423049 7.649132 3.657402 5
S100A8 ENSG00000143546 7674 21.638235 True True 1.328664 7.228290 3.463029 7
HLA-DRA ENSG00000204287 14836 41.832793 True True 2.242150 7.513039 4.208062 12
GNLY ENSG00000115523 5196 14.651064 True True 0.882395 4.859677 2.504143 13
... ... ... ... ... ... ... ... ... ...
TPK1 ENSG00000196511 917 2.585648 True True 0.085252 0.289441 0.268601 5554
AL355916.1 ENSG00000232774 99 0.279148 True True 0.010097 0.037130 0.032563 5557
MEI1 ENSG00000167077 1813 5.112082 True True 0.178823 0.611417 0.574647 5559
AL035701.1 ENSG00000231769 222 0.625969 True True 0.021673 0.078620 0.070820 5563
KIAA1324 ENSG00000116299 155 0.437051 True True 0.015310 0.054929 0.048948 5564

2000 rows × 9 columns

We can also view HVGs in a scatterplot:

In [21]:
pg.hvfplot(data_trial, dpi=200)

In this plot, each point stands for one gene. Blue points are selected to be HVGs, which account for the majority of variation of the dataset.

Principal Component Analysis¶

To reduce the dimension of data, Principal Component Analysis (PCA) is widely used. Briefly speaking, PCA transforms the data from original dimensions into a new set of Principal Components (PC) of a much smaller size. In the transformed data, dimension is reduced, while PCs still cover a majority of the variation of data. Moreover, the new dimensions (i.e. PCs) are independent with each other.

pegasus uses the following method to perform PCA:

In [22]:
pg.pca(data_trial)
2022-02-24 14:40:13,009 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 8.03s.

By default, pca uses:

  • Before PCA, scale the data to standard Normal distribution $N(0, 1)$, and truncate them with max value $10$;
  • Number of PCs to compute: 50;
  • Apply PCA only to highly variable features.

See its documentation for customization.

To explain the meaning of PCs, let's look at the first PC (denoted as $PC_1$), which covers the most of variation:

In [23]:
coord_pc1 = data_trial.uns['PCs'][:, 0]
coord_pc1
Out[23]:
array([ 0.02221761,  0.01772111, -0.00582949, ..., -0.00050337,
        0.04850996,  0.03549923], dtype=float32)

This is an array of 2000 elements, each of which is a coefficient corresponding to one HVG.

With the HVGs as the following:

In [24]:
data_trial.var.loc[data_trial.var['highly_variable_features']].index.values
Out[24]:
array(['HES4', 'ISG15', 'TNFRSF18', ..., 'RPS4Y2', 'MT-CO1', 'MT-CO3'],
      dtype=object)

$PC_1$ is computed by

\begin{equation*} PC_1 = \text{coord_pc1}[0] \cdot \text{HES4} + \text{coord_pc1}[1] \cdot \text{ISG15} + \text{coord_pc1}[2] \cdot \text{TNFRSF18} + \cdots + \text{coord_pc1}[1997] \cdot \text{RPS4Y2} + \text{coord_pc1}[1998] \cdot \text{MT-CO1} + \text{coord_pc1}[1999] \cdot \text{MT-CO3} \end{equation*}

Therefore, all the 50 PCs are the linear combinations of the 2000 HVGs.

The calculated PCA count matrix is stored in the obsm field, which is the first embedding object we have

In [25]:
data_trial.obsm['X_pca'].shape
Out[25]:
(35465, 50)

For each of the $35,465$ cells, its count is now w.r.t. 50 PCs, instead of 2000 HVGs.

Nearest Neighbors¶

All the downstream analysis, including clustering and visualization, needs to construct a k-Nearest-Neighbor (kNN) graph on cells. We can build such a graph using neighbors method:

In [26]:
pg.neighbors(data_trial)
2022-02-24 14:40:24,187 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 11.13s.
2022-02-24 14:40:25,520 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 1.33s.

It uses the default setting:

  • For each cell, calculate its 100 nearest neighbors;
  • Use PCA matrix for calculation;
  • Use L2 distance as the metric;
  • Use hnswlib search algorithm to calculate the approximated nearest neighbors in a really short time.

See its documentation for customization.

Below is the result:

In [27]:
print(f"Get {data_trial.obsm['pca_knn_indices'].shape[1]} nearest neighbors (excluding itself) for each cell.")
data_trial.obsm['pca_knn_indices']
Get 99 nearest neighbors (excluding itself) for each cell.
Out[27]:
array([[30526, 27514, 28825, ..., 18019, 22273, 32915],
       [29723,  2651,  5282, ..., 29922, 14100, 30101],
       [35262, 20170, 30032, ..., 34146,  3880,  2412],
       ...,
       [ 6096, 30824, 18992, ..., 14345, 34251, 20801],
       [34252, 34709, 17569, ..., 32455,  8648,  6047],
       [ 5379, 35401, 31722, ...,  1585, 32585, 16009]])
In [28]:
data_trial.obsm['pca_knn_distances']
Out[28]:
array([[ 4.984169 ,  5.4866176,  5.5318155, ...,  6.568183 ,  6.570032 ,
         6.5835915],
       [ 8.275161 ,  8.679711 ,  8.974716 , ..., 10.486567 , 10.48813  ,
        10.531118 ],
       [ 4.793777 ,  5.1250052,  5.2052846, ...,  6.0857906,  6.092467 ,
         6.0957317],
       ...,
       [ 7.3680816,  7.414498 ,  7.4873137, ...,  9.415309 ,  9.418577 ,
         9.421834 ],
       [ 8.759402 ,  8.7638645,  9.546374 , ..., 11.836472 , 11.852649 ,
        11.85586  ],
       [ 7.129679 ,  7.265322 ,  7.413504 , ..., 11.326677 , 11.330711 ,
        11.331036 ]], dtype=float32)

Each row corresponds to one cell, listing its neighbors (not including itself) from nearest to farthest. data_trial.uns['pca_knn_indices'] stores their indices, and data_trial.uns['pca_knn_distances'] stores distances.

Clustering and Visualization¶

Now we are ready to cluster the data for cell type detection. pegasus provides 4 clustering algorithms to use:

  • louvain: Louvain algorithm, using louvain package.
  • leiden: Leiden algorithm, using leidenalg package.
  • spectral_louvain: Spectral Louvain algorithm, which requires Diffusion Map.
  • spectral_leiden: Spectral Leiden algorithm, which requires Diffusion Map.

See this documentation for details.

In this tutorial, we use the Louvain algorithm:

In [29]:
pg.louvain(data_trial)
2022-02-24 14:40:27,030 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 1.41s.
2022-02-24 14:40:54,890 - pegasus.tools.clustering - INFO - Louvain clustering is done. Get 19 clusters.
2022-02-24 14:40:54,927 - pegasus.tools.clustering - INFO - Function 'louvain' finished in 29.38s.

As a result, Louvain algorithm finds 19 clusters:

In [30]:
data_trial.obs['louvain_labels'].value_counts()
Out[30]:
1     5623
2     4320
3     3594
4     2889
5     2851
6     2660
7     2227
8     1927
9     1625
10    1432
11    1298
12    1032
13     938
14     897
15     550
16     430
17     407
18     386
19     379
Name: louvain_labels, dtype: int64

We can check each cluster's composition regarding donors via a composition plot:

In [31]:
pg.compo_plot(data_trial, 'louvain_labels', 'Channel')

However, we can see a clear batch effect in the plot: e.g. Cluster 11 and 14 have most cells from Donor 3.

We can see it more clearly in its FIt-SNE plot (a visualization algorithm which we will talk about later):

In [32]:
pg.tsne(data_trial)
Will use momentum during exaggeration phase
Computing input similarities...
Using perplexity, so normalizing input data (to prevent numerical problems)
Using perplexity, not the manually set kernel width.  K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
Using ANNOY for knn search, with parameters: n_trees 50 and search_k 4500
Going to allocate memory. N: 35465, K: 90, N*K = 3191850
Building Annoy tree...
Done building tree. Beginning nearest neighbor search... 
parallel (4 threads):
2022-02-24 14:41:46,653 - pegasus.tools.visualization - INFO - Function 'tsne' finished in 50.72s.
[===========================================================>] 99% 5.135s    ] 46% 2.382s
In [33]:
pg.scatter(data_trial, attrs=['louvain_labels', 'Channel'], basis='tsne')

Batch Correction¶

Batch effect occurs when data samples are generated in different conditions, such as date, weather, lab setting, equipment, etc. Unless informed that all the samples were generated under the similar condition, people may suspect presumable batch effects if they see a visualization graph with samples kind-of isolated from each other.

For this dataset, we need the batch correction step to reduce such a batch effect, which is observed in the plot above.

In this tutorial, we use Harmony algorithm for batch correction. It requires redo HVG selection, calculate new PCA coordinates, and apply the correction:

In [34]:
pg.highly_variable_features(data, batch='Channel') 
pg.pca(data)
pca_key = pg.run_harmony(data)
2022-02-24 14:41:48,888 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.66s.
2022-02-24 14:41:48,943 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-02-24 14:41:48,944 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.72s.
2022-02-24 14:41:59,815 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 10.87s.
2022-02-24 14:42:01,153 - pegasus.tools.batch_correction - INFO - Start integration using Harmony.
	Initialization is completed.
	Completed 1 / 10 iteration(s).
	Completed 2 / 10 iteration(s).
	Completed 3 / 10 iteration(s).
	Completed 4 / 10 iteration(s).
	Completed 5 / 10 iteration(s).
	Completed 6 / 10 iteration(s).
	Completed 7 / 10 iteration(s).
	Completed 8 / 10 iteration(s).
Reach convergence after 8 iteration(s).
2022-02-24 14:42:28,581 - pegasus.tools.batch_correction - INFO - Function 'run_harmony' finished in 28.76s.

The corrected PCA coordinates are stored in data.obsm:

In [35]:
data.obsm['X_pca_harmony'].shape
Out[35]:
(35465, 50)

pca_key is the representation key returned by run_harmony function, which is equivalent to string "pca_harmony". In the following sections, you can use either pca_key or "pca_harmony" to specify rep parameter in Pegasus functions whenever applicable.

Repeat Previous Steps on the Corrected Data¶

As the count matrix is changed by batch correction, we need to recalculate nearest neighbors and perform clustering. Don't forget to use the corrected PCA coordinates as the representation:

In [36]:
pg.neighbors(data, rep=pca_key)
pg.louvain(data, rep=pca_key)
2022-02-24 14:42:42,163 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 13.56s.
2022-02-24 14:42:43,778 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 1.61s.
2022-02-24 14:42:45,613 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 1.83s.
2022-02-24 14:43:08,300 - pegasus.tools.clustering - INFO - Louvain clustering is done. Get 16 clusters.
2022-02-24 14:43:08,335 - pegasus.tools.clustering - INFO - Function 'louvain' finished in 24.56s.

Let's check the composition plot now:

In [37]:
pg.compo_plot(data, 'louvain_labels', 'Channel')

If everything goes properly, you should be able to see that no cluster has a dominant donor cells. Also notice that Louvain algorithm on the corrected data finds 16 clusters, instead of the original 19 ones.

Also, FIt-SNE plot is different:

In [38]:
pg.tsne(data, rep=pca_key)
Symmetrizing...
Using the given initialization.
Exaggerating Ps by 12.000000
Input similarities computed (sparsity = 0.003936)!
Learning embedding...
Using FIt-SNE approximation.
Iteration 50 (50 iterations in 1.33 seconds), cost 5.705014
Iteration 100 (50 iterations in 1.28 seconds), cost 5.225178
Iteration 150 (50 iterations in 1.11 seconds), cost 5.110843
Iteration 200 (50 iterations in 1.16 seconds), cost 5.040103
Iteration 250 (50 iterations in 1.24 seconds), cost 5.012636
Unexaggerating Ps by 12.000000
Iteration 300 (50 iterations in 1.17 seconds), cost 3.788683
Iteration 350 (50 iterations in 1.15 seconds), cost 3.378922
Iteration 400 (50 iterations in 1.51 seconds), cost 3.164257
Iteration 450 (50 iterations in 2.23 seconds), cost 3.046179
Iteration 500 (50 iterations in 2.80 seconds), cost 2.961952
Iteration 550 (50 iterations in 4.27 seconds), cost 2.902651
Iteration 600 (50 iterations in 4.85 seconds), cost 2.856435
Iteration 650 (50 iterations in 5.39 seconds), cost 2.82212022-02-24 14:43:56,425 - pegasus.tools.visualization - INFO - Function 'tsne' finished in 47.35s.
61
Iteration 700 (50 iterations in 5.98 seconds), cost 2.792967
Iteration 750 (50 iterations in 6.71 seconds), cost 2.767614
Will use momentum during exaggeration phase
Computing input similarities...
Using perplexity, so normalizing input data (to prevent numerical problems)
Using perplexity, not the manually set kernel width.  K (number of nearest neighbors) and sigma (bandwidth) parameters are going to be ignored.
Using ANNOY for knn search, with parameters: n_trees 50 and search_k 4500
Going to allocate memory. N: 35465, K: 90, N*K = 3191850
Building Annoy tree...
Done building tree. Beginning nearest neighbor search... 
parallel (4 threads):
[===========================================================>] 99% 5.313s                                   ] 20% 1.069s
In [39]:
pg.scatter(data, attrs=['louvain_labels', 'Channel'], basis='tsne')

You can see that the right-hand-side plot has a much better mixture of cells from different donors.

Visualization¶

tSNE Plot¶

In previous sections, we have seen data visualization using FIt-SNE. FIt-SNE is a fast implementation on tSNE algorithm, and Pegasus uses it for the tSNE embedding calculation. See details

UMAP Plot¶

Besides tSNE, pegasus also provides UMAP plotting methods:

  • umap: UMAP plot, using umap-learn package. See details
  • net_umap: Approximated UMAP plot with DNN model based speed up. See details

Below is the UMAP plot of the data using umap method:

In [40]:
pg.umap(data, rep=pca_key)
2022-02-24 14:43:58,291 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2022-02-24 14:43:58,292 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.
UMAP(min_dist=0.5, precomputed_knn=(array([[    0, 32454, 33203, ..., 34421, 4585, 33415],
       [    1, 7658, 32973, ..., 33173, 21503, 20769],
       [    2, 13228, 20787, ..., 27422, 28285, 34507],
       ...,
       [35462, 31460, 28289, ..., 11139, 21229, 20250],
       [35463, 34709, 8206, ..., 26634, 1453, 17529],
       [35464, 35401, 8702, ..., 13853, 32612, 24418]]), array([[ 0.       , 5.1344223, 5.1892424, ..., 5.80048  , 5.834852 ,
         5.835117 ],
       [ 0.       , 7.705061 , 8.047319 , ..., 8.791697 , 8.835941 ,
         8.85679  ],
       [ 0.       , 4.039253 , 4.5694957, ..., 5.1763353, 5.1773987,
         5.2111535],
       ...,
       [ 0.       , 6.6650157, 6.7569427, ..., 7.623494 , 7.6727667,
         7.6997523],
       [ 0.       , 8.1451   , 9.145099 , ..., 10.019779 , 10.028571 ,
        10.075575 ],
       [ 0.       , 5.8484592, 5.908896 , ..., 7.372008 , 7.3747406,
         7.4893894]], dtype=float32), <pegasus.tools.visualization.DummyNNDescent object at 0x7ff4a3e6e1d0>), random_state=0, verbose=True)
Thu Feb 24 14:43:58 2022 Construct fuzzy simplicial set
OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Thu Feb 24 14:44:02 2022 Construct embedding
Epochs completed:   0%|            0/200 [00:00]
Thu Feb 24 14:44:36 2022 Finished embedding
2022-02-24 14:44:36,035 - pegasus.tools.visualization - INFO - Function 'umap' finished in 37.74s.
In [41]:
pg.scatter(data, attrs=['louvain_labels', 'Channel'], basis='umap')

Differential Expression Analysis¶

With the clusters ready, we can now perform Differential Expression (DE) Analysis. DE analysis is to discover cluster-specific marker genes. For each cluster, it compares cells within the cluster with all the others, then finds genes significantly highly expressed (up-regulated) and lowly expressed (down-regulated) for the cluster.

Now use de_analysis method to run DE analysis. We use Louvain result here.

In [42]:
pg.de_analysis(data, cluster='louvain_labels')
2022-02-24 14:44:38,340 - pegasus.tools.diff_expr - INFO - CSR matrix is converted to CSC matrix. Time spent = 0.9538s.
2022-02-24 14:44:58,967 - pegasus.tools.diff_expr - INFO - MWU test and AUROC calculation are finished. Time spent = 20.6268s.
2022-02-24 14:44:59,085 - pegasus.tools.diff_expr - INFO - Sufficient statistics are collected. Time spent = 0.1181s.
2022-02-24 14:44:59,194 - pegasus.tools.diff_expr - INFO - Differential expression analysis is finished.
2022-02-24 14:44:59,196 - pegasus.tools.diff_expr - INFO - Function 'de_analysis' finished in 21.81s.

By default, DE analysis runs Mann-Whitney U (MWU) test.

Alternatively, you can also run the follow tests by setting their corresponding parameters to be True:

  • fisher: Fisher’s exact test.
  • t: Welch’s T test.

DE analysis result is stored with key "de_res" (by default) in varm field of data. See documentation for more details.

To load the result in a human-readable format, use markers method:

In [43]:
marker_dict = pg.markers(data)

By default, markers:

  • Sort genes by Area under ROC curve (AUROC) in descending order;
  • Use $\alpha = 0.05$ significance level on q-values for inference.

See documentation for customizing these parameters.

Let's see the up-regulated genes for Cluster 1, and rank them in descending order with respect to log fold change:

In [44]:
marker_dict['1']['up'].sort_values(by='log2FC', ascending=False)
Out[44]:
log2Mean log2Mean_other log2FC percentage percentage_other percentage_fold_change auroc mwu_U mwu_pval mwu_qval
featurekey
LTB 6.078218 2.750911 3.327308 88.982513 41.377892 2.150485e+00 0.752276 138050716.5 0.000000 0.000000
TRAC 4.584767 1.860424 2.724343 72.988869 29.610968 2.464927e+00 0.715238 131253949.5 0.000000 0.000000
BCL11B 3.803313 1.143754 2.659559 64.610497 20.260496 3.188989e+00 0.731830 134298667.0 0.000000 0.000000
CD3D 4.584280 2.014685 2.569595 75.119240 32.243359 2.329759e+00 0.701153 128669176.5 0.000000 0.000000
CD3E 4.084703 1.814133 2.270570 69.030205 30.365038 2.273345e+00 0.686962 126064947.0 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ...
AL513008.1 0.001603 0.000000 0.001603 0.031797 0.000000 1.000000e+30 0.500159 91784550.0 0.002321 0.005037
AC141002.1 0.001573 0.000000 0.001573 0.031797 0.000000 1.000000e+30 0.500159 91784550.0 0.002321 0.005037
AP001972.3 0.001547 0.000000 0.001547 0.031797 0.000000 1.000000e+30 0.500159 91784550.0 0.002321 0.005037
AC093797.1 0.001542 0.000000 0.001542 0.031797 0.000000 1.000000e+30 0.500159 91784550.0 0.002321 0.005037
AC023051.1 0.001522 0.000000 0.001522 0.031797 0.000000 1.000000e+30 0.500159 91784550.0 0.002321 0.005037

1327 rows × 10 columns

Among them, TRAC worth notification. It is a critical marker for T cells.

We can also use Volcano plot to see the DE result. Below is such a plot w.r.t. Cluster 1 with MWU test results (by default):

In [45]:
pg.volcano(data, cluster_id = '1', dpi=200)

The plot above uses the default thresholds: log fold change at $1$ (i.e. fold change at $2$), and q-value at $0.05$. Each point stands for a gene. Red ones are significant marker genes: those at right-hand side are up-regulated genes for Cluster 1, while those at left-hand side are down-regulated genes.

We can see that gene TRAC is the second to rightmost point, which is a significant up-regulated gene for Cluster 1.

To store a specific DE analysis result to file, you can write_results_to_excel methods in pegasus:

In [46]:
pg.write_results_to_excel(marker_dict, "MantonBM_subset.de.xlsx")
2022-02-24 14:45:41,335 - pegasus.tools.diff_expr - INFO - Excel spreadsheet is written.
2022-02-24 14:45:41,490 - pegasus.tools.diff_expr - INFO - Function 'write_results_to_excel' finished in 18.97s.

Cell Type Annotation¶

After done with DE analysis, we can use the test result to annotate the clusters.

In [47]:
celltype_dict = pg.infer_cell_types(data, markers = 'human_immune')
cluster_names = pg.infer_cluster_names(celltype_dict)

infer_cell_types has 2 critical parameters to set:

  • markers: Either 'human_immune', 'mouse_immune', 'human_brain', 'mouse_brain', 'human_lung', or a user-specified marker dictionary.
  • de_test: Decide which DE analysis test to be used for cell type inference. It can be either 't', 'fisher', or 'mwu'. Its default is 'mwu'.

infer_cluster_names by default uses threshold = 0.5 to filter out candidate cell types of scores lower than 0.5.

See documentation for details.

Below is the cell type annotation report for Cluster 1:

In [48]:
celltype_dict['1']
Out[48]:
[name: T cell; score: 1.00; average marker percentage: 65.21%; strong support: (CD3D+,75.12%),(CD3E+,69.03%),(CD3G+,43.72%),(TRAC+,72.99%)]

The report has a list of predicted cell types along with their scores and support genes for users to decide.

Next, substitute the inferred cluster names in data using annotate function:

In [49]:
pg.annotate(data, name='anno', based_on='louvain_labels', anno_dict=cluster_names)
data.obs['anno'].value_counts()
Out[49]:
Naive T cell                   6290
CD14+ Monocyte                 5177
B cell                         4332
T helper cell                  3272
Cytotoxic T cell               2953
Natural killer cell            2600
Cytotoxic T cell-2             2596
Cytotoxic T cell-3             1739
Erythroid cells                1629
Hematopoietic stem cell        1441
Pre B cell                      933
CD14+ Monocyte-2                685
CD1C+ dendritic cell            550
CD16+ Monocyte                  469
Plasma cell                     408
Plasmacytoid dendritic cell     391
Name: anno, dtype: int64

So the cluster-specific cell type information is stored in data.obs['anno'].

The anno_dict can be either a list or a dictionary. If provided a list (which is the case here), Pegasus will match cell types with cluster labels in the same order. Alternatively, you can create an annotation dictionary with keys being cluster labels and cell types being values.

In practice, users may want to manually create this annotation structure by reading the report in celltype_dict. In this tutorial, we'll just use the output of infer_cluster_names function for demonstration.

Now plot the data with cell types:

In [50]:
pg.scatter(data, attrs='anno', basis='tsne', dpi=100)
In [51]:
pg.scatter(data, attrs='anno', basis='umap', legend_loc='on data', dpi=150)

Raw Count vs Log-norm Count¶

Now let's check the count matrix:

In [52]:
data
Out[52]:
MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 35465 x 25653
    Genome: GRCh38; Modality: rna
    It contains 2 matrices: 'X', 'raw.X'
    It currently binds to matrix 'X' as X

    obs: 'n_genes', 'Channel', 'gender', 'n_counts', 'percent_mito', 'scale', 'louvain_labels'(cluster), 'anno'
    var: 'featureid', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features', 'mean', 'var', 'hvf_loess', 'hvf_rank'
    obsm: 'X_pca', 'X_pca_harmony', 'pca_harmony_knn_indices'(knn), 'pca_harmony_knn_distances'(knn), 'X_tsne'(basis), 'X_umap'(basis)
    varm: 'means', 'partial_sum', 'de_res'
    obsp: 'W_pca_harmony'
    uns: 'genome', 'modality', 'df_qcplot', 'norm_count', 'ncells', 'stdzn_mean', 'stdzn_std', 'stdzn_max_value', '_tmp_fmat_highly_variable_features', 'PCs', 'pca', 'pca_features', '_attr2type', 'louvain_resolution'

You can see that besides X, there is another matrix raw.X generated for this analysis. As the key name indicates, raw.X stores the raw count matrix, which is the one after loading from the original Zarr file; while X stores the log-normalized counts.

data currently binds to matrix X. To use the raw count instead, type:

In [53]:
data.select_matrix('raw.X')
data
Out[53]:
MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 35465 x 25653
    Genome: GRCh38; Modality: rna
    It contains 2 matrices: 'X', 'raw.X'
    It currently binds to matrix 'raw.X' as X

    obs: 'n_genes', 'Channel', 'gender', 'n_counts', 'percent_mito', 'scale', 'louvain_labels'(cluster), 'anno'
    var: 'featureid', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features', 'mean', 'var', 'hvf_loess', 'hvf_rank'
    obsm: 'X_pca', 'X_pca_harmony', 'pca_harmony_knn_indices'(knn), 'pca_harmony_knn_distances'(knn), 'X_tsne'(basis), 'X_umap'(basis)
    varm: 'means', 'partial_sum', 'de_res'
    obsp: 'W_pca_harmony'
    uns: 'genome', 'modality', 'df_qcplot', 'norm_count', 'ncells', 'stdzn_mean', 'stdzn_std', 'stdzn_max_value', '_tmp_fmat_highly_variable_features', 'PCs', 'pca', 'pca_features', '_attr2type', 'louvain_resolution'

Now data binds to raw counts.

We still need log-normalized counts for the following sections, so reset the default count matrix:

In [54]:
data.select_matrix('X')

Cell Development Trajectory and Diffusion Map¶

Alternative, pegasus provides cell development trajectory plots using Force-directed Layout (FLE) algorithm:

  • pg.fle: FLE plot, using Force-Atlas 2 algorithm in forceatlas2-python package. See details
  • pg.net_fle: Approximated FLE plot with DNN model based speed up. See details

Moreover, calculation of FLE plots is on Diffusion Map of the data, rather than directly on data points, in order to achieve a better efficiency. Thus, we need to first compute the diffusion map structure:

In [55]:
pg.diffmap(data, rep=pca_key)
2022-02-24 14:45:42,484 - pegasus.tools.diffusion_map - INFO - Calculating connected components is done.
2022-02-24 14:45:42,706 - pegasus.tools.diffusion_map - INFO - Calculating normalized affinity matrix is done.
2022-02-24 14:45:49,998 - pegasus.tools.diffusion_map - INFO - Detected knee point at t = 196.
2022-02-24 14:45:50,050 - pegasus.tools.diffusion_map - INFO - Function 'diffmap' finished in 7.63s.

By default, diffmap method uses:

  • Number of Diffusion Components = 100
  • Compute diffusion map from PCA matrix.

In this tutorial, we should use the corrected PCA matrix, which is specified in pca_key. The resulting diffusion map is in data.obsm with key "X_diffmap":

In [56]:
data.obsm['X_diffmap'].shape
Out[56]:
(35465, 99)

Now we are ready to calculate the pseudo-temporal trajectories of cell development. We use fle here:

In [57]:
pg.fle(data)
2022-02-24 14:45:56,118 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 6.05s.
2022-02-24 14:45:56,912 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 0.79s.
2022-02-24 14:45:57,343 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 0.43s.
Feb 24, 2022 2:45:59 PM org.netbeans.modules.masterfs.watcher.Watcher getNotifierForPlatform
INFO: Native file watcher is disabled
Feb 24, 2022 2:46:03 PM org.gephi.io.processor.plugin.DefaultProcessor process
INFO: # Nodes loaded: 35,465
Feb 24, 2022 2:46:03 PM org.gephi.io.processor.plugin.DefaultProcessor process
INFO: # Edges loaded: 1,103,392
100 iterations, change_per_node = 285.07505171898
200 iterations, change_per_node = 88.6603748495822
300 iterations, change_per_node = 69.96269455601114
400 iterations, change_per_node = 58.382668471308946
500 iterations, change_per_node = 42.48427053370875
600 iterations, change_per_node = 36.51743123615614
700 iterations, change_per_node = 57.50085077306505
800 iterations, change_per_node = 53.466973813590286
900 iterations, change_per_node = 31.473549128584533
1000 iterations, change_per_node = 26.949127279195555
1100 iterations, change_per_node = 23.89739484695084
1200 iterations, change_per_node = 21.818792608307866
1300 iterations, change_per_node = 16.834493971338812
1400 iterations, change_per_node = 14.549610773797745
1500 iterations, change_per_node = 24.999551478312924
1600 iterations, change_per_node = 14.344507827615091
1700 iterations, change_per_node = 10.028001999019168
calc_mwu finished for genes in [0, 6414).
calc_mwu finished for genes in [12827, 19240).
calc_mwu finished for genes in [6414, 12827).
calc_mwu finished for genes in [19240, 25653).
1800 iterations, change_per_node = 16.79699630327543
1900 iterations, change_per_node = 14.926362548924274
2000 iterations, change_per_node = 7.4008317506310375
2100 iterations, change_per_node = 11.197200999216168
2200 iterations, change_per_node = 15.44583429980423
2300 iterations, change_per_node = 7.960986359557257
2400 iterations, change_per_node = 7.765516795742545
2500 iterations, change_per_node = 12.014859493203284
2600 iterations, change_per_node = 8.92648900279234
2700 iterations, change_per_node = 7.708658842814037
2800 iterations, change_per_node = 8.555552015473822
2900 iterations, change_per_node = 6.4304069506032455
3000 iterations, change_per_node = 6.1636712753162906
3100 iterations, change_per_node = 4.222439054191193
3200 iterations, change_per_node = 7.340950279476349
3300 iterations, change_per_node = 3.4823769481160305
3400 iterations, change_per_node = 5.625180557013902
3500 iterations, change_per_node = 5.3982330933094955
3600 iterations, change_per_node = 8.0552669377968
3700 iterations, change_per_node = 4.813566311184252
3800 iterations, change_per_node = 3.5346873126043383
3900 iterations, change_per_node = 3.2635326666072233
4000 iterations, change_per_node = 3.8688297195206784
4100 iterations, change_per_node = 5.111444480548418
4200 iterations, change_per_node = 5.791658958482641
4300 iterations, change_per_node = 4.555203157150939
4400 iterations, change_per_node = 3.6941067114838573
4500 iterations, change_per_node = 2.034265639988344
Finished in 4521 iterations, change_per_node = 1.9790443738285401
Time = 652.529s
2022-02-24 14:56:51,625 - pegasus.tools.visualization - INFO - Function 'fle' finished in 661.57s.

And show FLE plot regarding cell type annotations:

In [58]:
pg.scatter(data, attrs='anno', basis='fle')

Save Result to File¶

Use write_output function to save analysis result data to file:

In [59]:
pg.write_output(data, "result.zarr.zip")        # Do we need to update the zarr name?
2022-02-24 14:56:52,441 - pegasusio.zarr_utils - WARNING - Detected and removed pre-existing file result.zarr.zip.
2022-02-24 14:56:54,577 - pegasusio.readwrite - INFO - zarr.zip file 'result.zarr.zip' is written.
2022-02-24 14:56:54,577 - pegasusio.readwrite - INFO - Function 'write_output' finished in 2.14s.

It's stored in zarr format, because this is the default file format in Pegasus.

Alternatively, you can also save it in h5ad, mtx, or loom format. See its documentation for instructions.

Read More...¶

  • Read Plotting Tutorial for more plotting functions provided by Pegasus.

  • Read Pegasus API documentation for details on Pegasus functions.